import pandas as pd
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
import theano.tensor as tt
from data import load_baseball
from utils import despine, ECDF, despine_traceplot
import arviz as az
# For deterministic reproducibility.
np.random.seed(42)
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Baseball players have many metrics measured for them. Let's say we are on a baseball team, and would like to quantify player performance, one metric being their batting average (defined by how many times a batter hit a pitched ball, divided by the number of times they were up for batting ("at bat")). How would you go about this task?
Discuss:
Let's read in the data.
df = load_baseball()
p, which indicates probability of "success".n trials. Parameterized by both n and p.p in a Bernoulli or Binomial. Parameterized by $\alpha$ and $\beta$, which can be thought of as "number of successes" and "number of failures" respectively.Every distribution has its "story". If you're curious, check out Justin Bois' probability stories page.
Let's say we wanted to model a probability distribution centered approximately on 0.2. Depending on our parameterization of the Beta distribution, we can express different levels of confidence (as measured by the spread of the distribution) as to how sure we are a distribution takes on that value.
The code + chart below should illustrate this clearly.
# To correctly express the beta distribution, we subtract 1.
low_prior = np.random.beta(a=1, b=7, size=100000)
med_prior = np.random.beta(a=19, b=79, size=100000)
high_prior = np.random.beta(a=199, b=799, size=100000)
fig = plt.figure()
ax = fig.add_subplot(111)
labels = ['low', 'med', 'high']
priors = [low_prior, med_prior, high_prior]
for label, prior in zip(labels, priors):
x, y = ECDF(prior)
ax.plot(x, y, label=label)
ax.legend()
ax.set_xlim(0, 1)
ax.set_xlabel('probability parameter')
ax.set_ylabel('cumulative fraction')
despine(ax)
Notice how increasing the scale of parameterization (1x, 10x, 100x) decreased the variance. This leads to another way of looking at the Beta distribution:
We will see this in action below!
Exercise: Write a naive estimation model for the players above.
Hint, a possible model you could specify is as follows:

with pm.Model() as baseline_model:
thetas = pm.Beta("thetas", alpha=0.5, beta=0.5, shape=(len(df)))
like = pm.Binomial('likelihood', n=df['AB'], p=thetas, observed=df['H'])
Now, sample from the posterior
with baseline_model:
baseline_trace = pm.sample(2000)
Plot the traceplots to check for convergence.
traceplot = az.plot_trace(baseline_trace)
despine_traceplot(traceplot)
Visualize the posterior distribution using forest plots.
ylabels = ylabels = "AB: " + df['AB'].astype(str) + ', H: ' + df['H'].astype('str')
az.plot_forest(baseline_trace)
Discuss: Are the estimates reasonable, particularly for players that have had only one at bat (AB)?
Discuss:
If it is qualitatively (and maybe also quantitatively) justifiable, we can impose the following assumption on the modelling process.
We can assume that there is some underlying distribution for a player's batting average. We can then assume that each new player draws his own batting average from the underlying distribution. This encodes our intuition that human performance falls on some continuum that has a region of "concentrated" average. (Think "normal-like" distribution.)
The modelling choices for the hierarchical model below are detailed in the PyMC3 docs, but are copied below for convenience.
We will assume that there exists a hidden factor (phi) related to the expected performance for all players (not limited to our 18). Since the population mean is an unknown value between 0 and 1, it must be bounded from below and above. Also, we assume that nothing is known about global average. Hence, a natural choice for a prior distribution is the uniform distribution.
Next, we introduce a hyperparameter kappa to account for the variance in the population batting averages, for which we will use a bounded Pareto distribution. This will ensure that the estimated value falls within reasonable bounds. These hyperparameters will be, in turn, used to parameterize a beta distribution, which is ideal for modeling quantities on the unit interval. The beta distribution is typically parameterized via a scale and shape parameter, it may also be parametrized in terms of its mean $\mu \in [0, 1]$ and sample size (a proxy for variance) $\nu = \alpha + \beta (\nu > 0)$
The final step is to specify a sampling distribution for the data (hit or miss) for every player, using a Binomial distribution. This is where the data are brought to bear on the model.
We could use
pm.Pareto('kappa', m=1.5), to define our prior on kappa, but the Pareto distribution has very long tails. Exploring these properly is difficult for the sampler, so we use an equivalent but faster parametrization using the exponential distribution. We use the fact that the log of a Pareto distributed random variable follows an exponential distribution.
The most important thing to note is how the ${\theta}$ parameters are drawn from a parental distribution that is not broadcast across the samples.
Notes:
In visual form, this is the model below.

with pm.Model() as baseball_model:
phi = pm.Uniform('phi', lower=0.0, upper=1.0)
kappa_log = pm.Exponential('kappa_log', lam=1.5)
kappa = pm.Deterministic('kappa', tt.exp(kappa_log))
thetas = pm.Beta('thetas', alpha=phi*kappa, beta=(1.0-phi)*kappa, shape=len(df))
like = pm.Binomial('like', n=df['AB'], p=thetas, observed=df['H'])
Sample from the posterior.
with baseball_model:
trace = pm.sample(2000, nuts_kwargs={'target_accept': 0.95})
Visualize the trace plots to check for convergence.
ax_arr = az.plot_trace(trace)
despine_traceplot(ax_arr)
Visualize the posterior distributions using forestplots.
# ylabels = "AB: " + df['AB'].astype(str) + ', H: ' + df['H'].astype('str')
traceplots = az.plot_trace(trace, var_names=['thetas']) # old stuff (need to raise issue with arviz) ylabels=ylabels, xlim=[0, 1]
With a hierarchical model, we make the assumption that our observations (or treatments that group our observations) are somehow related. Under this assumption, when we have a new sample for which we have very few observations, we are able to borrow power from the population to make inferences about the new sample.
Depending on the scenario, this assumption can either be reasonable, thereby not necessitating much debate, or be considered a "strong assumption", thereby requiring strong justification.
"Shrinkage" is a term used to describe how hierarchical model estimation will usually result in parameter estimates that are "shrunk" away from their maximum likelihood estimators (i.e. the naive estimate from the data) towards the global mean.
Shrinkage in and of itself is not necessarily a good or bad thing. However, because hierarchical models can sometimes be tricky to get right, we can use a shrinkage plot as a visual diagnostic for whether we have implemented the model correctly.
# MLE per player
no_pool = df['batting_avg'].values
# Hierarchical model
partial_pool = trace['thetas'].mean(axis=0)
# MLE over all players
complete_pool = np.array([df['batting_avg'].mean()] * len(df))
fig = plt.figure()
ax = fig.add_subplot(111)
for r in np.vstack([no_pool, partial_pool, complete_pool]).T:
ax.plot(r)
ax.set_xticks([0, 1, 2])
ax.set_xticklabels(['no pooling', 'partial pooling', 'complete pooling'])
ax.set_ylim(0, 1)
despine(ax)
This model implementation is taken from the PyMC3 docs.
The data come from the Baseball Data Bank, and are redistributed with this repository for educational purposes.